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Abstract — Particle swarm optimization (PSO) has been 
employed on several optimization problems, including 
the clustering problem. PSO has also been employed in 
the clustering of data of different structure and 
dimensionality. In this paper it is employed in the 
clustering of nucleic acid sequences. The application of 
clustering, as a statistical tool, in the analysis of data of 
varied complexity has been treated by several 
researchers. Besides PSO, distance-based algorithms 
have been widely proposed for the clustering problem. 
This paper investigates the efficiency of PSO clustering 
on nucleic acid sequences through the introduction of 
distance measures among which are the Euclidean 
distance measure, Manhattan distance, edit distance 
and the codon-based scoring method (COBASM). Sub- 
objective weights were introduced to observe the 
behaviour of PSO under various conditions. From the 
result obtained, PSO-based clustering produces 
compact and well-separated clusters. However, the 
result varied with distance measure. 
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IV. Introduction 

Clustering, as an important aspect of knowledge 
discovery, has as its main aim to group related 
elements based on some predefined measure of 
closeness or proximity. Clustering involves the 
discovery of relationships in data without the 
application of any prior knowledge of the 
relationships. The final result of clustering depends 
on the perception of the user through the application 
of some subjective decisions. These decisions are 
(1) the definition and measurement of the 
relationships between the data elements that would 
warrant clustering, (2) the actual number of clusters 
expected in the clustering task, and (3) the 
representation of the generated clusters. Most 
conventional clustering algorithms employ the use of 
distance or similarity measures to determine objects 
proximity and to generate clusters [1]. 



Clustering, in computational biology, goes 
beyond a mere statistical tool for information 
retrieval. It actually reveals the genetic information 
of participating sequences. Such information helps 
in the determination of gene families and the 
establishment of implicit links between them. 
Clustering of biological sequence data presents a 
great challenge to the computing society as well as 
to biologists. This challenge arises from the fact that 
sequence data cannot be easily clustered by the 
application of conventional distance or similarity 
measures Also, string edit distance algorithms 
employed in string comparisons and string similarity 
searches are mostly not suitable in biological 
sequence data clustering [2]. This is basically 
because the structural nature of biological 
sequences makes string edit distance not 
appropriate. For example, the edit between the 
strings bbbbbbbddddddd and dddddddbbbbbbb 
clearly shows there is no similarity between the 
strings. However, looking at the strings biologically, 
there is an element of structural similarity which the 
edit distance neglects. Since the issue of structural 
similarity is major in biological sequence analysis 
the edit distance and other distance-based 
algorithms are incapable of clustering biological 
sequences. 

The introduction of particle swarm optimization 
(PSO) becomes necessary at this point to since it has 
been proven to be robust in the handling of 
optimization problems [3]. This means, then, that 
distance measures will have to be used with the PSO- 
based clustering method to observe their performance 
under various conditions. Since PSO has already 
been successfully applied to data clustering and 
image segmentation [4], [3], this paper investigates 
the efficiency of PSO-based clustering method in 
clustering nucleic acid sequences with respect to the 
distance measures. The measures used are the 
Euclidean distance, the Manhattan distance, and the 
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edit distance. The codon-based scoring method 
(COBASM) [5] is also used with PSO in the clustering 
of nucleic acids sequences. COBASM considers the 
application of codons 1 to maintain the structural 
similarity of sequences. 

The remainder of the paper is organized as follows: 
Section II presents related work, Section III discusses 
particle swarm optimization, Section IV describes 
distance measures employed in the PSO clustering 
task, Section V is devoted to the experimental results 
obtained with the PSO-based sequence clustering, 
and Section VI presents the conclusion, and 
directions for further research. 

V. RELATED work 

Several methods have been proposed for data 
clustering tasks [6]. These methods have been 
divided into two broad categories: Hierarchical and 
partitional. One of the highly researched partitional 
algorithm is the K-means algorithm. It is a partitional 
iterative clustering approach [7] to data clustering. 
The K-means algorithm is popular and most criticized 
for its demanding the number of clusters for a 
clustering task a priori. However, K-means algorithm 
is simple and easier to implement with linear time 
complexity. 

The Fuzzy-C means (FCM) is a clustering method 
that introduces the fuzzy version of the K-means [8], 
[9]. Although FCM still demands the provision of the 
value of K a priori, it outperforms the K-means in that 
it is less affected by the presence of uncertainty in the 
data [10]. 

The K-harmonic means algorithm computes the 
harmonic means of each cluster centre to every 
pattern and then updates the cluster centroids 
accordingly [11]. The K-harmonic means is less 
affected by the initial conditions. Experimental results 
show that it outperforms the FCM and K-means [12]. 

Yang and Wang [2] proposed CLUSEQ for the 
clustering of sequences based on sequence structural 



A codon is simply a tri-nucleotide (triplets of bases - A, C, G, and 
U or T, typifying Adenine, Cytosine, Guanine, Uracil and Thymine, 
respectively) sequence that is used to identify or specify an amino 
acid. 



features and exhibited statistical properties. CLUSEQ 
builds a probabilistic suffix tree in the initialization of 
sequence. Although this method seems better than 
most sequence clustering methods, CLUSEQ does 
not consider that some sequences can exhibit closer 
similarity than others depending on whether the 
sequences and amino acids or nucleic acids [13]. 

Most clustering methods employ distance 
measures to determine the proximity of data 
elements. Some of these distance/similarity measures 
are mentioned in Section IV. However, the edit 
distance, originally designed for similarity search is 
also employed in clustering tasks. It has been proven 
that the edit distance lacks the ability to handle 
sequences based on their structural similarities [2]. 
Muthukrishnan and Sahinalp [14] proposed the edit 
distance with the use of block operations all in an 
attempt to optimize the edit distance's performance. 
Furthermore, to still optimize the efficiency of the edit 
distance, Cormode and Muthukrishnan [15] 
introduced a greedy algorithm to reduce moves of 
substrings to moves of characters and convert moves 
of characters to only inserts and deletes. 

In the same vein, Lopresti and Tomkins [16] 
proposed block edit models for approximate string 
matching, which could be extended to sequence 
clustering, by examining string edit distance in which 
two strings are compared by extracting collections of 
substrings and placing the two strings into 
correspondence with each other. 

VI. PARTICLE SWARM OPTIMIZATION 

Particle swarm optimization (PSO) is derived 
from the social behaviour of, and the implicit rules 
adhered to by birds in a flock that enable them 
move synchronously without colliding [17]. The 
belief that social sharing of information by members 
of a population may provide an evolutionary 
advantage was the basic idea behind the 
development of PSO [18]. Naturally, our problems 
are sometimes solved by our interaction one with 
another. Our interaction produces socio-cognitive 
experience which ultimately affects our behaviours 
and attitudes, otherwise referred to as the social 
and cognitive components. The cognitive 
component represents the particle's own experience 
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as to where the best solution is, while the social 
component represents the belief of the entire swarm 
as to where the best solution is. PSO simulates this 
idea of a social optimization where social organisms 
tend to move towards the direction of optimal 
benefit. 

The two early variants of the PSO algorithm 
are referred to as the gbest (global best) PSO and the 
Ibest (local best) PSO. The particles (or a swarm of 
individuals) in the gbest PSO move toward their best 
previous positions and toward the best particle in the 
entire swarm. In the Ibest PSO each particle moves 
towards its best previous positions and towards the 
best particle in its restricted neighbourhood [19]. The 
gbest PSO has been employed in unsupervised 
image classification and is considered efficient in 
cluster analysis in comparison to Ibest PSO [3]. The 
personal best position, y of particle / is the best 

position the particle has ever visited. The best 
position is the position that resulted in the best fitness 
value. Considering f to represent a fitness function, 
then, the personal best position of particle / at time 
step t is computed as: 

™ f 11 \ kim n if /(ii(f+t)</<y,<0) 

The current position of particle / is denoted by Xj. The 
velocity of particle / for the Ibest PSO is calculated as 
in equation (3). For the gbest PSO, y,j = y s , for all / 
=1,..., n x (the total size of the swarm) where y iS is the 
neighbourhood best position of the particle and y y is 
the position of the global best particle. 

vtj{t + 1) = t V {f) + ctrij<f)[if 4 (f) -xij(t)] m 

where vrft), y^t) and xfi) are the velocity, the 
personal best position and the current position, 
respectively, of particle / in an A/ d -dimensional swarm, 
P, for j =1, ■■■ ,N d at time step t, c 1 and c 2 are positive 
acceleration constants used to scale the contribution 
of the cognitive and social components respectively, 
and ryfl), r 2j (t) □ U(0, 1) are random values in the 
range [0,1]. Equation (3) is used to update the 
particle's new position at every iteration. 

A. PSO Clustering Method 



PSO has been used by Van der Merwe and 
Engelbrecht [4] to cluster sets of multidimensional 
data using a fitness function consisting of quantization 
error only. In general, the results show that the PSO- 
based clustering algorithm performs better than the K- 
means algorithm. PSO is more likely to find near- 
optimal solutions than K-means. This is because, 
whereas PSO is less sensitive to the effect of the 
initial conditions owing to its population-based nature, 
K-means, as a greedy algorithm, depends on the 
initial conditions. 

PSO-based clustering has also been used by 
Omran [3] in the clustering of image pixels. In his 
work, several versions of PSO were examined. The 
gbest PSO was found to outperform most of the other 
versions on most data sets. 

Tillett et al. [20] employed PSO in the clustering of 
sensors in a sensor network. When the PSO 
technique was tested against random search and 
simulated annealing, it was found to be more robust. 

PSO has also been applied in document clustering 
[21]. Cui et al. demonstrated that the hybrid PSO 
algorithm employed in the task of document clustering 
was able to generate more compact clusters in 
comparison to the K-means algorithm. 

Gene clustering was done by Xiao et al. [22] by 

proposing the application of Self-Organizing Map 
(SOM) and PSO. SOM and PSO were applied 
independently in gene clustering. The result obtained 
when both methods were used was better than when 
the the individual methods were used. 

B. PSO-based Clustering Algorithm 

In this paper PSO-based clustering is employed in 
the clustering of nucleic acid sequence data, with 
minor modifications on the data type. Several 
nucleotides combine to form a nucleic acid sequence 
which are referred to in this paper as patterns. Each 
sequence represents a particle (a candidate solution) 
in the swarm. Patterns identify particles, and a single 
particle represents the cluster centroids in the 
individual clusters. To measure the fitness of each 
particle, Equation (4) was used. 

where Intmax and Intmin are respectively the intra and 
inter-cluster distances, wi and W2 are user-defined 
constants used respectively to specify the weight that 
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influences how much the intra and the inter-cluster 
distances will contribute to the final fitness, and rimax 
is the maximum value in the data set (between 0 and 
5 in this paper, i.e. 4). The intra and inter-cluster 
distances are measured by calculating the maximum 
and minimum average distance within and between 
the clusters, respectively [3], and are given as 



Fig. 1 . The PSO clustering algorithm 



fc= I. -■ .K 




\5\ 



and 



t'Il-. i. 1 I 



I65 



where S k is the /c f/? cluster, s, is the f h sequence in 
cluster S k , c k is the centroid of S k , m k is the number of 
sequences in Sk, and K is the number of clusters 
formed for the clustering problem. The notation d(x,y) 
is used in equations (5), (6) and (7) to denote the 
distance between the properties x and y. Quantization 
error function is employed to determine the quality of 
the clustering and is defined as: 



1 j\ 



IIU 



In summary, the PSO clustering algorithm is given in 
Figurel. 



Initialize each sequence to contain Ck cluster 
centroids; 

for t =1 tO I max dO 

for each sequence (si) 

(i) calculate the distance, d(s,, c k ) for all 
clusters c/c-centroid of cluster S k 

(ii) allocate sequence si to cluster S k for 
d(Si, c k ) = min □ k=1,- ,K{d(s h c k )} 

(iii) calculate fitness using equation(4) 

Update the pbest position and the gbest 
solution. 



VII. 



DISTANCE MEASURES 



This section examines distance/similarity measures 
employed in this paper in the clustering of nucleic acid 
sequences. Most clustering tasks are performed based 
on some similarity or dissimilarity measures. Distance 
or similarity measures are mathematical 
representations of closeness or similarity. The 
selection of distance measures for clustering is an 
important task. This is because it has the ability to 
influence the shape of the clusters, as some patterns 
may be close to one another according to one distance 
measure and farther away according to another. This 
was observed in the under-listed distance measures. 

A. Euclidean Distance 

The most widely-used distance measures are the 
Euclidean distance and the squared Euclidean 
distance. The Minkowski metric from which the 
Euclidean distance is derived, is defined as 



v., 



The Euclidean distance is a special case of the 
Minkowski metric where /3 = 2 [23]. The Euclidean 
distance tends to form hyper-spherical clusters [23]. 
The squared Euclidean distance metric uses the 
same equation as the Euclidean distance metric, but 
without the square root. This makes clustering with 
the squared Euclidean distance metric faster than 
with the regular Euclidean distance. 



B. Edit Distance 

The edit distance (also called the Levenshtein 
distance) is another distance measure developed by 
Levenshtein [24], and employed in sequence 
similarity search. The edit distance is a generalization 
of the Hamming distance. It is used in DNA sequence 
analysis, plagiarism detection, speech recognition, 
and spell checking [25]. The edit distance is the 
minimum number of edit operations (insertions, 
deletions and substitutions) needed to transform one 
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sequence into another. For two sequences S^L.i], 
and S 2 [1..j] the edit distance (ED) between S1and S2 
(denoted by d(i, j)) is defined as 



u her " * KJ) " \ 0 if S.dj = ll0) 

The value d(i,j) is, therefore, the minimum edit 
operations needed to transform the first / characters 
of S 1 into the first j characters of S 2 . Using the 
algorithm in Figure 2, the edit distance d(IJ) is 
calculated using a bottom-up dynamic programming 
approach as is common to most string algorithms 
[26]. From the algorithm, if the lengths of S 1 and S 2 
are denoted by n and m, respectively, the edit 
distance between the two sequences is the value 
d(n,m), obtained by computing d(i,j) for all 
combinations of / and j, for 0 < i < n and 0 <j < m. 

The edit distance is simple and easy to implement. 
However, it has the following disadvantages: 

The edit distance has an order of mn time 
and space complexity (O(mn)), which makes it rather 
too slow when the dataset is large. 

It parallelizes poorly as a result of large data 
dependencies. 



Fig. 2. Edit Distance Algorithm. 

C. The Codon-based Scoring Method 

The codon-based scoring method (COBASM) [27], [5] 
takes an entire source sequence and compares each 
character with the target the same way the edit 
distance does. However, instead of scoring 
mismatches, COBASM scores a match. Where there 
are matches, between the characters compared, 
COBASM scores 1 per character and 0 otherwise. If 
there are consecutive blocks of three characters that 
are similar, an additional 1 is added to the score. This 
procedure continues until all the characters are 
compared. In other to capture all the codons in the 
target sequence, COBASM continues the search on 
the second position in the target sequence. The idea 
is to capture the principle governing the construction 
of the codon table used in the formation of the twenty 
amino acids found in protein. 

Nucleic acid (DNA/RNA) sequences are only 
considered similar if the percentage similarity is 70% 
[13]. Therefore, the value obtained from COBASM 
must be up to 70% the entire length of the source 
sequence before it could be considered a member of 
the cluster. The algorithm is given in Figure 3. 

A contiguous collection of nucleotide symbols is 
what is referred to as sequence. The symbols are A, 
C, G, T in DNA, and a replacement of T with U in 
RNA. In sequence clustering, data are represented in 
symbolic form and need to be converted to numeric 
form to implement PSO. To achieve this, the 
nucleotides are assigned values to convert them to 
numeric as follows: A=1, C=2, G=3, U=T=4. The 
resultant sequence data can be interpreted to mean a 
series of events that are separated by intervals. A 
symbol (now represented in numeric form) is 
regarded as an event and a comma (,) an interval. An 
event interval is, therefore, represented by a lower 
and an upper bound, as (1, 3, 2) with an interval 
between in a 3-dimensional plane, to mean AGC. A 
sequence of length 60 will have 60 events of 59 
intervals, i.e. 60-dimensions. COBASM is simple to 
implement and results have proved that it is robust in 
the task of sequence clustering as compared to edit 
distance. In the experiment performed in this paper, 
when Euclidean distance is replaced with COBASM in 
PSO-based sequence clustering, the result obtained 
shows a significant improvement over other methods. 

It is proven by Baridam [5] that COBASM satisfies 



int ED(char s[1..m], char t[1..n]) 

declare int d[0..m, 0..n] // d is a table with m+1 
rows 

//and n+1 columns 

for /" = 0,...,m do 
d[i,0] = i 
endfor 

for j = 0,...,n do 

d[0,j]:=j 
endfor 

for /" = 1,...,m do 
for j = 1,..., n do 
if s[i] = t[j] then 

cost = 0 
else 
cost = 1 

// deletion, insertion and substitution 

d[i, j] = minimum(d[i-1, j] + 1, d[i, j-1] + 1, d[i- 

1J-1] 

+ cost) 
endif 

onHfr-vr 
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Initialize S1 and S2; 

for| S1 |:/= 1 ton do 
for| S2\:j= 1 to m do 
//determine length of longest sequence 
//if sequences are unaligned or unequal if n<m 
//if length of sequences less than longest sequence 
//do pattern-element-search 
//Compare s1[i] with s2[j],s2[j +1], ■■■ ,s2[m-n] and 
//s1[i + 1] with s2[j + 1],s2[j + 2], - ,s2[m - n + 1] 

\fs1[i]= s2[j] 

score =1 
else 

score =0 
endif 

if n = m //length of sequences are equal 
if s1[i]= s2[j] //examine each character ofS1 and 
//S2 

score =1 
else 

score =0 
endif 
endif 

//split sequence S1 and S2 (including gaps if 
aligned) //into blocks of three nucleotides each 
and compare //adjacent blocks 
for /, j > 0 do 

//do a total block-match 
if s1[i +1,i +2,i +3] = s2[j +1J +2J + 3] 

score = score +1 
endif 
endfor 
endfor 
endfor 
return score 



the condition for metrics. This justifies the usage of 
COBASM alongside other distance metrics in this 
paper. 



D. Manhattan Distance 
The Manhattan distance metric is defined as: 

where N d is the number of variables, and Sn and S 2/ 

th 

are the values of the / variable, at points S 1 and S 2 
respectively. 

The Manhattan distance is measured as the sum of 
the displacements along the vertical and horizontal 
axes. This implies that the Manhattan distance 
function computes the distance between points 
through a grid-like path. The Manhattan distance 
metric is poor with datasets of high dimensionality 
[28]. 

VIII. EXPERIMENTAL RESULTS 

This section compares the results of applying 
different distance/similarity measures with the PSO 
clustering algorithm in the clustering of six sequence 
datasets. The distance measures are Euclidean 
distance, edit distance (ED), Manhattan distance 
measures and COBASM. The six datasets used were 
emblFasta Rickettsia typhi str. RNA sequences with 
Accession Number AE017197 from Wilmington 
Complete Genome of 1111500 nucleotides, Homo 
sapiens' melanatonic melanoma DNA sequences, 
mRNA bos taurus sequences from Genetic Sequence 
Databank with Accession Number BE484664 ob- 
tained from the work of Sonstegard, et al [29], and 
DNA dental sequences from Department of Micro- 
biology, University of Pretoria, South Africa. 



Fig. 3. A pseudo-code for the codon-based scoring method 



The main purpose was to compare the quality of 
the clusters generated by each distance measure 
based on 

the quantization error, Qe 
the intra-cluster distances, lnt max and 
the inter-cluster distances, lnt min . The intra- 
cluster and inter-cluster distances defines the degree 
of compactness and separability of generated 
clusters. For all the results obtained, averages of 30 
simulations over 100 iterations are reported with 
standard deviations to indicate the range of values to 
which the distance measures converge. 



table i 

performance comparison 
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The following data sets, of varying complexities, 
were employed. 

Dataset 1: 500 Rickettsia typhi str. RNA 
sequences consisting of 30000 nucleotides. 

Dataset 2: 200 Rickettsia typhi str. RNA 
sequences consisting of 12000 nucleotides. 

Dataset 3: 100 Rickettsia typhi str. RNA 
sequences consisting of 6000 nucleotides. 

Dataset 4: 31 DNA dental sequences of 
varying lengths consisting of approximately 12550 
nucleotides. 

Dataset 5: 20 Homo sapiens' melanatonic 
melanoma DNA sequences of varying lengths and 
a total of 15658 nucleotides with the longest 
sequence having 1471, and the shortest 134 
nucleotides long. 

Dataset 6: 141 mRNA bos taurus 
sequences of 29718 nucleotides with the longest 
sequence having 508, and the shortest 198 
nucleotides long. Accession date: June 15, 2008. 

Table 1 summarizes the results obtained for 
each of the four distance measures. 
Investigations of the influence of sub-objective 
weights on the intra-and inter-cluster distances on 
the final fitness were done. To determine the 
quality of clusters generated using Equation (4), 
weights were employed as follows: w 1 = 0.5, 0.6, 
0.3, 0.8, 0.1 and w 2 = 0.5, 0.4, 0.7, 0.2, 0.9, 
respectively. The values are chosen to ensure sum 
of the weights (w 1 and w 2 ) equals 1.0. The final 
results obtained from this parametric clustering are 
very much dependent on the number of iterations, 
hence the results in Table I. 
The results obtained show some remarkable 
improvement in quality, compactness and 
separability of clusters generated with COBASM 
on virtually all the datasets as indicated by the 
values generated in Table 1. The performance of 
PSO when the other distance measures were 
employed also showed some significant results. 
This shows the robustness of the PSO-based 
sequence clustering. However, it was observed 
that Manhattan distance performed very poorly in 
all cases. This confirms that Manhattan distance 
measure is poor in the handling of high 
dimensional data [28]. 

For Dataset 1, the quality of clusters generated 
improved from 60.7711 with w 1 = w 2 =0.5 to 
24.7662 with the weights set to 0.3 and 0.7, 



respectively with COBASM. The quality further 
improved with the weights set to 0.6 and 0.4, 
respectively with all the distance measures. A 
significant result was obtained when the weights 
were set to 0.8 and 0.2, respectively. The results, 
again, became poor with the weights set to 0.1 and 
0.9, respectively. From these results, it is clear that 
an increase in the value of w 1 produced better 
quality of generated clusters. These trends were 
observed on all the other datasets. The results 
obtained further demonstrate that numeric-based 
distance measures do not produce best clustering 
results on nucleic acid sequences. 

IX. Conclusion and further research 

This paper investigated the performance of 
PSO-based clustering method as applied to the 
clustering of nucleic acid sequences by introducing 
distance measures. The performances of the three 
distance measures namely edit distance, 
Manhattan distance and COBASM were examined 
alongside Euclidean distance, as they were 
applied in the clustering of the high-dimensional 
problems. Several sub-objective weights were 
used to observe the robustness of the method. 
PSO was found to perform best when COBASM 
was introduced in the clustering problem. The 
performance was evaluated based on the quality, 
compactness and separability of formed clusters. 
The results demonstrate that numeric-based 
distance measures are not capable of producing 
quality clusters on nucleic acid sequences. 

This work can be extended by applying PSO 
with the codon-based scoring method in the 
clustering of amino acids (protein) sequences. In 
the experiment conducted in this paper, multi- 
dimensional problems were avoided by truncating 
the sequences to the nearest available dimension 
that could be handled by PSO clustering functions. 
An extension to multi-dimensional problems will be 
a novel contribution. 
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